Traveled speeds

The Quota for Exercise of Parliamentary Activity says that meal expenses can be reimbursed just for the politician, excluding guests and assistants. Creating a feature with information of traveled speed from last meal can help us detect anomalies compared to other expenses.

Since we don't have in structured data the time of the expense, we want to anylize the group of expenses made in the same day.

  • Learn how to calculate distance between two coordinates.
  • Filter "Congressperson meal" expenses.
  • Order by date.
  • Merge reimbursements.xz dataset with companies.xz, so we have latitude/longitude for each expense.
  • Remove expenses with less than 12 hours of distance between each other.

...

  • Filter specific congressperson.
In [1]:
import pandas as pd
import numpy as np

reimbursements = pd.read_csv('../data/2016-11-19-reimbursements.xz',
                             dtype={'cnpj_cpf': np.str},
                             low_memory=False)
In [2]:
reimbursements.iloc[0]
Out[2]:
year                                                       2009
applicant_id                                               1001
document_id                                             1564212
reimbursement_value_total                                   NaN
total_net_value                                             130
reimbursement_numbers                                      2888
congressperson_name                            DILCEU SPERAFICO
congressperson_id                                         73768
congressperson_document                                     444
term                                                       2015
state                                                        PR
party                                                        PP
term_id                                                      55
subquota_number                                               3
subquota_description                       Fuels and lubricants
subquota_group_id                                             1
subquota_group_description                 Veículos Automotores
supplier                      MELHOR POSTO DE COMBUSTÍVEIS LTDA
cnpj_cpf                                         02989654001197
document_number                                           43059
document_type                                                 0
issue_date                                  2009-04-06T00:00:00
document_value                                              130
remark_value                                                  0
net_values                                                  130
month                                                         4
installment                                                   0
passenger                                                   NaN
leg_of_the_trip                                             NaN
batch_number                                             388810
reimbursement_values                                        NaN
Name: 0, dtype: object
In [3]:
reimbursements = reimbursements[reimbursements['subquota_description'] == 'Congressperson meal']
reimbursements.shape
Out[3]:
(191724, 31)
In [4]:
reimbursements['issue_date'] = pd.to_datetime(reimbursements['issue_date'], errors='coerce')
reimbursements.sort_values('issue_date', inplace=True)
In [5]:
companies = pd.read_csv('../data/2016-09-03-companies.xz', low_memory=False)
companies.shape
Out[5]:
(60047, 228)
In [6]:
companies.iloc[0]
Out[6]:
situation_date                                                     03/11/2005
type                                                                   MATRIZ
name                             COMPANHIA DE AGUAS E ESGOTOS DE RORAIMA CAER
phone                                                          (95) 3626-5165
situation                                                               ATIVA
neighborhood                                                        SAO PEDRO
address                                                        R MELVIN JONES
number                                                                    219
zip_code                                                           69.306-610
city                                                                BOA VISTA
state                                                                      RR
opening                                                            21/11/1969
legal_entity                              203-8 - SOCIEDADE DE ECONOMIA MISTA
trade_name                                                                NaN
cnpj                                                       05.939.467/0001-15
last_updated                                         2016-07-08T05:55:31.679Z
status                                                                     OK
additional_address_details                                                NaN
email                                                                     NaN
responsible_federative_entity                                             NaN
situation_reason                                                          NaN
special_situation                                                         NaN
special_situation_date                                                    NaN
message                                                                   NaN
main_activity_code                                                 36.00-6-01
main_activity                     Captação, tratamento e distribuição de água
secondary_activity_1                                                      NaN
secondary_activity_10                                                     NaN
secondary_activity_10_code                                                NaN
secondary_activity_11                                                     NaN
                                                     ...                     
secondary_activity_88_code                                                NaN
secondary_activity_89                                                     NaN
secondary_activity_89_code                                                NaN
secondary_activity_8_code                                                 NaN
secondary_activity_9                                                      NaN
secondary_activity_90                                                     NaN
secondary_activity_90_code                                                NaN
secondary_activity_91                                                     NaN
secondary_activity_91_code                                                NaN
secondary_activity_92                                                     NaN
secondary_activity_92_code                                                NaN
secondary_activity_93                                                     NaN
secondary_activity_93_code                                                NaN
secondary_activity_94                                                     NaN
secondary_activity_94_code                                                NaN
secondary_activity_95                                                     NaN
secondary_activity_95_code                                                NaN
secondary_activity_96                                                     NaN
secondary_activity_96_code                                                NaN
secondary_activity_97                                                     NaN
secondary_activity_97_code                                                NaN
secondary_activity_98                                                     NaN
secondary_activity_98_code                                                NaN
secondary_activity_99                                                     NaN
secondary_activity_99_code                                                NaN
secondary_activity_9_code                                                 NaN
latitude                                                              2.82788
longitude                                                            -60.6601
latitude.1                                                            2.82788
longitude.1                                                          -60.6601
Name: 0, dtype: object
In [7]:
companies['cnpj'] = companies['cnpj'].str.replace(r'[\.\/\-]', '')
In [8]:
dataset = pd.merge(reimbursements, companies, left_on='cnpj_cpf', right_on='cnpj')
dataset.shape
Out[8]:
(176005, 259)
In [9]:
dataset.iloc[0]
Out[9]:
year                                                               2011
applicant_id                                                       2303
document_id                                                     2003049
reimbursement_value_total                                           NaN
total_net_value                                                      80
reimbursement_numbers                                              3554
congressperson_name                                       RONALDO ZULKE
congressperson_id                                                160594
congressperson_document                                             515
term                                                               2011
state_x                                                              RS
party                                                                PT
term_id                                                              54
subquota_number                                                      13
subquota_description                                Congressperson meal
subquota_group_id                                                     0
subquota_group_description                                          NaN
supplier                      BRENDT EMPREENDIMENTOS E ALIMENTAÇÃO LTDA
cnpj_cpf                                                 72614977000290
document_number                                                   38110
document_type                                                         0
issue_date                                          2001-02-01 00:00:00
document_value                                                       80
remark_value                                                          0
net_values                                                           80
month                                                                 2
installment                                                           0
passenger                                                           NaN
leg_of_the_trip                                                     NaN
batch_number                                                     519202
                                                ...                    
secondary_activity_88_code                                          NaN
secondary_activity_89                                               NaN
secondary_activity_89_code                                          NaN
secondary_activity_8_code                                           NaN
secondary_activity_9                                                NaN
secondary_activity_90                                               NaN
secondary_activity_90_code                                          NaN
secondary_activity_91                                               NaN
secondary_activity_91_code                                          NaN
secondary_activity_92                                               NaN
secondary_activity_92_code                                          NaN
secondary_activity_93                                               NaN
secondary_activity_93_code                                          NaN
secondary_activity_94                                               NaN
secondary_activity_94_code                                          NaN
secondary_activity_95                                               NaN
secondary_activity_95_code                                          NaN
secondary_activity_96                                               NaN
secondary_activity_96_code                                          NaN
secondary_activity_97                                               NaN
secondary_activity_97_code                                          NaN
secondary_activity_98                                               NaN
secondary_activity_98_code                                          NaN
secondary_activity_99                                               NaN
secondary_activity_99_code                                          NaN
secondary_activity_9_code                                           NaN
latitude                                                       -16.0209
longitude                                                      -48.0604
latitude.1                                                     -16.0209
longitude.1                                                    -48.0604
Name: 0, dtype: object

Remove party leaderships from the dataset before calculating the ranking.

In [10]:
dataset = dataset[dataset['congressperson_id'].notnull()]
dataset.shape
Out[10]:
(175071, 259)

And also remove companies mistakenly geolocated outside of Brazil.

In [11]:
is_in_brazil = (dataset['longitude'] < -34.7916667) & \
    (dataset['latitude'] < 5.2722222) & \
    (dataset['latitude'] > -33.742222) & \
    (dataset['longitude'] > -73.992222)
dataset = dataset[is_in_brazil]
dataset.shape
Out[11]:
(168568, 259)
In [12]:
# keys = ['applicant_id', 'issue_date']
keys = ['congressperson_name', 'issue_date']
aggregation = dataset.groupby(keys)['total_net_value']. \
    agg({'sum': np.sum, 'expenses': len, 'mean': np.mean})
In [13]:
aggregation['expenses'] = aggregation['expenses'].astype(np.int)
In [14]:
aggregation.sort_values(['expenses', 'sum'], ascending=[False, False]).head(10)
Out[14]:
sum expenses mean
congressperson_name issue_date
CELSO MALDANER 2011-09-05 750.28 13 57.713846
SANDRA ROSADO 2012-01-12 333.40 12 27.783333
2012-01-17 287.43 12 23.952500
2012-01-06 281.75 12 23.479167
LÉO VIVAS 2010-08-31 630.00 11 57.272727
SANDRA ROSADO 2012-01-11 541.56 11 49.232727
PAULO WAGNER 2011-07-21 537.66 11 48.878182
SANDRA ROSADO 2015-01-07 396.60 11 36.054545
2012-01-15 295.58 11 26.870909
FRANCISCO DE ASSIS 2014-11-18 565.99 10 56.599000
In [15]:
len(aggregation[aggregation['expenses'] > 7])
Out[15]:
32
In [16]:
keys = ['congressperson_name', 'issue_date']
cities = dataset.groupby(keys)['city']. \
    agg({'city': lambda x: len(set(x)), 'city_list': lambda x: ','.join(set(x))}).sort_values('city', ascending=False)
In [17]:
cities.head()
Out[17]:
city city_list
congressperson_name issue_date
TAKAYAMA 2014-06-25 6 LARANJEIRAS DO SUL,GUARAPUAVA,GUAIRA,FERNANDES...
ZECA DIRCEU 2012-02-14 5 PAICANDU,GUARULHOS,BRASILIA,PARANAVAI,MARINGA
RICARDO IZAR 2014-04-26 5 SAO PAULO,PRAIA GRANDE,BAURU,LINS,BOITUVA
PAULO FERREIRA 2013-02-08 5 IGARAPAVA,IPAMERI,LIMEIRA,BRASILIA,EMBU DAS ARTES
MARGARIDA SALOMÃO 2014-12-02 5 BARBACENA,RIO DE JANEIRO,BRASILIA,JUIZ DE FORA...
In [18]:
cities[cities['city'] >= 4].shape
Out[18]:
(127, 2)

Would be helpful for our analysis to have a new column containing the traveled distance in this given day.

In [19]:
from geopy.distance import vincenty as distance
from IPython.display import display

x = dataset.iloc[0]
display(x[['cnpj', 'city', 'state_y']])
distance(x[['latitude', 'longitude']],
         x[['latitude', 'longitude']])
cnpj       72614977000290
city             BRASILIA
state_y                DF
Name: 0, dtype: object
Out[19]:
Distance(0.0)
In [20]:
dataset.shape
Out[20]:
(168568, 259)
In [21]:
dataset[['latitude', 'longitude']].dropna().shape
Out[21]:
(168568, 2)
In [22]:
from itertools import tee

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

def calculate_distances(x):
    coordinate_list = x[['latitude', 'longitude']].values
    distance_list = [distance(*coordinates_pair).km
                     for coordinates_pair in pairwise(coordinate_list)]
    return np.nansum(distance_list)

distances = dataset.groupby(keys).apply(calculate_distances)
In [23]:
distances = distances.reset_index() \
    .rename(columns={0: 'distance_traveled'}) \
    .sort_values('distance_traveled', ascending=False)
distances.head()
Out[23]:
congressperson_name issue_date distance_traveled
112369 SANDRA ROSADO 2012-09-04 6969.643860
112201 SANDRA ROSADO 2012-01-12 6965.526389
112210 SANDRA ROSADO 2012-01-23 6833.928798
112221 SANDRA ROSADO 2012-02-08 5333.782132
112295 SANDRA ROSADO 2012-06-12 5309.248049

Now we are not ordering the list of cities, just calculating the distance between them in the order they are in the dataset. Since we don't have the time of the expenses to know their real order, one approach is to consider the shortest path between in the cities visited in the day by the congressperson.

In [24]:
import networkx as nx

G = nx.Graph()
In [25]:
G=nx.path_graph(5)
G
Out[25]:
<networkx.classes.graph.Graph at 0x138d8bda0>
In [26]:
path=nx.all_pairs_shortest_path(G)
path
Out[26]:
{0: {0: [0], 1: [0, 1], 2: [0, 1, 2], 3: [0, 1, 2, 3], 4: [0, 1, 2, 3, 4]},
 1: {0: [1, 0], 1: [1], 2: [1, 2], 3: [1, 2, 3], 4: [1, 2, 3, 4]},
 2: {0: [2, 1, 0], 1: [2, 1], 2: [2], 3: [2, 3], 4: [2, 3, 4]},
 3: {0: [3, 2, 1, 0], 1: [3, 2, 1], 2: [3, 2], 3: [3], 4: [3, 4]},
 4: {0: [4, 3, 2, 1, 0], 1: [4, 3, 2, 1], 2: [4, 3, 2], 3: [4, 3], 4: [4]}}
In [27]:
path[0][4]
Out[27]:
[0, 1, 2, 3, 4]
In [28]:
random_congressperson_day = cities[cities['city'] == 3].sample(random_state=0).reset_index().iloc[0]
matching_keys = ['congressperson_name', 'issue_date']
matches = \
    (dataset['congressperson_name'] == random_congressperson_day['congressperson_name']) & \
    (dataset['issue_date'] == random_congressperson_day['issue_date'])
expenses_for_graph = dataset[matches]
expenses_for_graph
Out[28]:
year applicant_id document_id reimbursement_value_total total_net_value reimbursement_numbers congressperson_name congressperson_id congressperson_document term ... secondary_activity_97_code secondary_activity_98 secondary_activity_98_code secondary_activity_99 secondary_activity_99_code secondary_activity_9_code latitude longitude latitude.1 longitude.1
69171 2013 2816 5197567 NaN 22.2 4382 NILMÁRIO MIRANDA 74751.0 542.0 2015.0 ... NaN NaN NaN NaN NaN NaN -15.784347 -47.913276 -15.784347 -47.913276
112211 2013 2816 5202209 NaN 50.0 4385 NILMÁRIO MIRANDA 74751.0 542.0 2015.0 ... NaN NaN NaN NaN NaN NaN -18.365815 -44.458726 -18.365815 -44.458726
135219 2013 2816 5202073 NaN 55.9 4402 NILMÁRIO MIRANDA 74751.0 542.0 2015.0 ... NaN NaN NaN NaN NaN NaN -16.723344 -43.871767 -16.723344 -43.871767

3 rows × 259 columns

In [29]:
def city_and_state(row):
    return '{} - {}'.format(row['city'], row['state_y'])

expenses_for_graph['city_state'] = expenses_for_graph.apply(city_and_state, axis=1)
expenses_for_graph['city_state']
/Users/irio/anaconda3/envs/serenata_de_amor/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[29]:
69171          BRASILIA - DF
112211          CORINTO - MG
135219    MONTES CLAROS - MG
Name: city_state, dtype: object
In [30]:
lat_longs = expenses_for_graph[['city_state', 'latitude', 'longitude']].values
# np.apply_along_axis(lambda x: (x[0], x[1]), axis=1, arr=lat_longs)
  • Create a node for each of the cities.
  • Connect each city with every other (making it a "complete graph").
  • Give weight to each of the edges, which should correspond to the distance between the cities.
  • Create a new node (artificial origin/destination for the Traveling Salesman).
  • Connect this new node with every other node, with weight equal to zero.
  • Run the Traveling Salesman algorithm starting from the artificial node.
In [31]:
from itertools import combinations

list(combinations(lat_longs.tolist(), 2))
Out[31]:
[(['BRASILIA - DF', -15.7843468, -47.9132763],
  ['CORINTO - MG', -18.3658152, -44.4587258]),
 (['BRASILIA - DF', -15.7843468, -47.9132763],
  ['MONTES CLAROS - MG', -16.723344400000002, -43.8717668]),
 (['CORINTO - MG', -18.3658152, -44.4587258],
  ['MONTES CLAROS - MG', -16.723344400000002, -43.8717668])]
In [32]:
def create_node(row):
    print(row[0], row[1], row[2])
    cities_graph.add_node(row[0], pos=(row[1], row[2]))
    return 42

cities_graph = nx.Graph()
np.apply_along_axis(create_node, axis=1, arr=lat_longs)

edges = list(combinations(lat_longs.tolist(), 2))
for edge in edges:
    weight = distance(edge[0][1:], edge[1][1:]).km
    print(edge[0][0], edge[1][0], weight)
    cities_graph.add_edge(edge[0][0], edge[1][0], weight=weight)
BRASILIA - DF -15.7843468 -47.9132763
CORINTO - MG -18.3658152 -44.4587258
MONTES CLAROS - MG -16.723344400000002 -43.8717668
BRASILIA - DF CORINTO - MG 465.6183735694923
BRASILIA - DF MONTES CLAROS - MG 444.33789522900355
CORINTO - MG MONTES CLAROS - MG 192.16579440591534
In [33]:
# cities_graph.add_node('starting_point')
# new_edges = [('starting_point', node) for node in cities_graph.nodes()]
# cities_graph.add_edges_from(new_edges, weight=0)
In [34]:
cities_graph.nodes()
Out[34]:
['BRASILIA - DF', 'CORINTO - MG', 'MONTES CLAROS - MG']
In [35]:
cities_graph.edges()
Out[35]:
[('BRASILIA - DF', 'CORINTO - MG'),
 ('BRASILIA - DF', 'MONTES CLAROS - MG'),
 ('CORINTO - MG', 'MONTES CLAROS - MG')]
  1. Acreditamos no Gist.
  2. Revisamos o Gist.
  3. Simplesmente esquecemos "distância mínima" e somamos todas as distâncias do complete graph.
In [36]:
def hamilton(G):
    F = [(G,[G.nodes()[0]])]
    n = G.number_of_nodes()
    while F:
        graph,path = F.pop()
        confs = []
        for node in graph.neighbors(path[-1]):
            conf_p = path[:]
            conf_p.append(node)
            conf_g = nx.Graph(graph)
            conf_g.remove_node(path[-1])
            confs.append((conf_g,conf_p))
        for g,p in confs:
            if len(p)==n:
                return p
            else:
                F.append((g,p))
    return None

hamilton(cities_graph)
Out[36]:
['BRASILIA - DF', 'MONTES CLAROS - MG', 'CORINTO - MG']
In [37]:
# print(lat_longs)
edges = list(combinations(lat_longs.tolist(), 2))
np.sum([distance(edge[0][1:], edge[1][1:]).km for edge in edges])
Out[37]:
1102.122063204411
In [38]:
def calculate_sum_distances(x):
    coordinate_list = x[['latitude', 'longitude']].values
    edges = list(combinations(coordinate_list, 2))
    return np.sum([distance(edge[0][1:], edge[1][1:]).km for edge in edges])

distances = dataset.groupby(keys).apply(calculate_sum_distances)
In [39]:
distances = distances.reset_index() \
    .rename(columns={0: 'distance_traveled'}) \
    .sort_values('distance_traveled', ascending=False)
distances.head()
Out[39]:
congressperson_name issue_date distance_traveled
112206 SANDRA ROSADO 2012-01-17 37969.076510
112200 SANDRA ROSADO 2012-01-11 30579.442632
112201 SANDRA ROSADO 2012-01-12 27617.903746
112196 SANDRA ROSADO 2012-01-06 23571.629213
112204 SANDRA ROSADO 2012-01-15 21144.918842
In [40]:
dataset_with_distances = \
    pd.merge(aggregation.reset_index(),
             distances,
             left_on=keys,
             right_on=keys)
dataset_with_distances.sort_values(['distance_traveled', 'expenses'], ascending=[False, False]).head(10)
Out[40]:
congressperson_name issue_date sum expenses mean distance_traveled
112206 SANDRA ROSADO 2012-01-17 287.43 12 23.952500 37969.076510
112200 SANDRA ROSADO 2012-01-11 541.56 11 49.232727 30579.442632
112201 SANDRA ROSADO 2012-01-12 333.40 12 27.783333 27617.903746
112196 SANDRA ROSADO 2012-01-06 281.75 12 23.479167 23571.629213
112204 SANDRA ROSADO 2012-01-15 295.58 11 26.870909 21144.918842
112205 SANDRA ROSADO 2012-01-16 182.14 8 22.767500 14072.247943
112210 SANDRA ROSADO 2012-01-23 139.42 7 19.917143 13185.187380
112198 SANDRA ROSADO 2012-01-09 340.18 10 34.018000 12717.700035
112248 SANDRA ROSADO 2012-03-28 225.73 6 37.621667 11324.130379
112221 SANDRA ROSADO 2012-02-08 309.46 6 51.576667 11320.374327
In [41]:
from altair import Chart

Chart(dataset_with_distances).mark_point().encode(
    x='expenses',
    y='distance_traveled',
)
In [42]:
dataset_with_distances.describe()
Out[42]:
sum expenses mean distance_traveled
count 130675.000000 130675.000000 130675.000000 130675.000000
mean 82.658007 1.289979 66.411733 40.001456
std 95.133361 0.632451 79.008626 320.772729
min 0.010000 1.000000 0.010000 0.000000
25% 33.300000 1.000000 29.400000 0.000000
50% 57.900000 1.000000 49.510000 0.000000
75% 106.150000 1.000000 86.900000 0.000000
max 6010.000000 13.000000 5852.000000 37969.076510
In [43]:
dataset_with_distances[dataset_with_distances['expenses'] > 4].shape
Out[43]:
(490, 6)
In [44]:
expenses_ceiling = dataset_with_distances['expenses'].mean() + \
    (3 * dataset_with_distances['expenses'].std())
expenses_ceiling
Out[44]:
3.1873315296945344
In [45]:
distance_traveled_ceiling = dataset_with_distances['distance_traveled'].mean() + \
    (3 * dataset_with_distances['distance_traveled'].std())
distance_traveled_ceiling
Out[45]:
1002.3196415922741
In [46]:
is_anomaly = (dataset_with_distances['expenses'] > expenses_ceiling) & \
    (dataset_with_distances['distance_traveled'] > distance_traveled_ceiling)
dataset_with_distances[is_anomaly].shape
Out[46]:
(471, 6)
In [47]:
dataset_with_distances.loc[is_anomaly].sum()
Out[47]:
sum                  9.688267e+04
expenses             2.218000e+03
mean                 2.078132e+04
distance_traveled    1.470252e+06
dtype: float64
In [48]:
dataset_with_distances.loc[is_anomaly, 'mean'].mean()
Out[48]:
44.121700380770442
In [49]:
len(dataset_with_distances.loc[is_anomaly]) / len(dataset_with_distances)
Out[49]:
0.003604361966711307
In [50]:
dataset_with_distances[is_anomaly] \
    .groupby('congressperson_name')['expenses'].count().reset_index() \
    .rename(columns={'expenses': 'abnormal_days'}) \
    .sort_values('abnormal_days', ascending=False) \
    .head(10)
Out[50]:
congressperson_name abnormal_days
20 DR. ADILSON SOARES 106
71 SANDRA ROSADO 58
31 FRANCISCO FLORIANO 39
80 ZECA DIRCEU 36
66 RENATO MOLLING 34
30 FRANCISCO DE ASSIS 24
74 TAKAYAMA 13
39 JORGE SOLLA 10
76 VANDERLEI MACRIS 10
34 GERALDO THADEU 8
In [ ]: